感知是自动驾驶系统的关键模块之一,最近取得了长足的进步。但是,单个车辆的能力有限,导致感知表现的瓶颈。为了突破个人感知的局限性,已经提出了协作感知,使车辆能够共享信息以了解超出视线和视野的环境。在本文中,我们对有关有前途的协作感知技术的相关工作进行了评论,包括介绍基本概念,推广协作模式并总结协作感知的关键成分和应用。最后,我们讨论了该研究领域的公开挑战和问题,并提供了一些潜在的方向。
translated by 谷歌翻译
现有的步态识别方法要么直接从原始步态序列建立全局特征表示(GFR),要么从几个本地部分生成本地特征表示(LFR)。但是,随着在更深层次的网络层中,GFR倾向于忽略人类姿势的局部细节。尽管LFR允许网络专注于每个局部区域的详细姿势信息,但它忽略了不同地方部分之间的关​​系,因此仅利用了几个特定区域的有限本地信息。为了解决这些问题,我们提出了一个名为GaitGL的基于全球的步态识别网络,以生成更具歧视性的特征表示。具体来说,开发了一个新颖的全球和局部卷积层(GLCL),以充分利用每一层中的全局视觉信息和局部区域细节。 GLCL是一种双支分支结构,由GFR提取器和基于掩模的LFR提取器组成。 GFR提取器旨在提取上下文信息,例如各个身体部位之间的关系,并提出了基于掩码的LFR提取器,以利用当地区域的详细姿势变化。此外,我们引入了一种基于面膜的新型策略,以提高局部特征提取能力。具体而言,我们设计了一对互补口罩以随机遮住特征图,然后在各种封闭的特征图上训练我们的基于面具的LFR提取器。通过这种方式,LFR提取器将学会完全利用本地信息。广泛的实验表明,步态比最先进的步态识别方法更好。 CASIA-B,OU-MVLP,增长和GAIT3D的平均排名准确性分别为93.6%,98.7%,68.0%和63.8%,明显优于竞争方法。拟议的方法在两场比赛中赢得了一等奖:HID 2020和HID 2021。
translated by 谷歌翻译
协作感知最近显示出具有对单一主体感知的感知能力的巨大潜力。现有的协作感知方法通常考虑理想的交流环境。但是,实际上,通信系统不可避免地遭受了延迟问题,从而导致潜在的性能降解和安全关键应用程序(例如自动驾驶)的高风险。从机器学习的角度来看,为了减轻不可避免的沟通潜伏期造成的效果,我们提出了第一个延迟感知的协作感知系统,该系统积极采用从多个代理到同一时间戳的异步感知特征,从而促进了协作的稳健性和有效性。为了实现此类特征级别的同步,我们提出了一个新型的延迟补偿模块,称为Syncnet,该模块利用特征注意的共生估计和时间调制技术。实验结果表明,在最新的协作感知数据集V2X-SIM上,我们的方法优于最先进的协作感知方法15.6%。
translated by 谷歌翻译
可靠且稳定的6D姿势估计不合作空间对象在轨道维修和清除碎片清除任务中起着至关重要的作用。考虑到姿势估计器对背景干扰很敏感,本文提出了一个名为CaspaceNet的反事实分析框架,以完成复杂背景下的Spaceborne目标的稳健6D姿势估计。具体而言,采用常规方法在事实情况下提取整个图像的特征。在反事实情况下,不存在无目标的图像,但只想想象背景。反事实分析降低了由背景干扰引起的副作用,从而导致最终结果中的预测无偏见。此外,我们还对Ca-paceNet进行了低位宽度量化,并将部分框架部署到FPGA上的内存加速器(PIM)加速器上。定性和定量结果证明了我们提出的方法的有效性和效率。据我们所知,本文首次将因果推理和网络量化应用于6D姿势估计太空源目标。该代码可在https://github.com/shunli-wang/ca-pacenet上获得。
translated by 谷歌翻译
近年来,评估视频的行动质量引起了计算机视觉群落和人机互动中的不断关注。大多数现有方法通常通过直接从动作识别任务迁移模型来解决这个问题,这忽略了特征映射内的内在差异,例如前景和背景信息。为了解决这个问题,我们提出了一种用于行动质量评估(AQA)的管自我关注网络(TSA网)。具体地,我们将单个对象跟踪器引入AQA并提出了管自我关注模块(TSA),可以通过采用稀疏特征交互有效地产生丰富的时空上下文信息。 TSA模块嵌入在现有的视频网络中以形成TSA-Net。总体而言,我们的TSA-网具有以下优点:1)高计算效率,2)灵活性高,3)最先进的性能。在包括AQA-7和MTL-AQA的流行动作质量评估数据集上进行了广泛的实验。此外,提出了一个名为Fint识别的数据集(FR-FS),以探索花样滑冰场景中的基本动作评估。
translated by 谷歌翻译
为了促进更好的性能带宽权衡,以实现多种代理人的感知,我们提出了一种新颖的蒸馏协作图(光盘),以模拟代理商之间的培训,姿势感知和适应性协作。我们的主要新科特迪斯在两个方面。首先,我们提出了一位教师学生框架通过知识蒸馏训练光盘。教师模型采用与全面查看输入的早期合作;学生模型基于中间协作与单视图输入。我们的框架通过在学生模型中约束协作后的特征地图来列进讨论,以匹配教师模型的对应关系。其次,我们提出了矩阵值的边缘重量。在这样的矩阵中,每个元素将互及的间歇注意力反映在特定空间区域,允许代理自适应地突出显示信息区域。在推论期间,我们只需要使用名为Distilled Collaboration Network的学生模型(Disconet)。归因于师生框架,具有共享Disconet的多个代理商可以协作地与整体视图进行假设教师模型的表现。我们的方法在V2X-SIM 1.0上验证了我们使用Carla和Sumo Co-Simulation合成的大规模多代理感知数据集。我们在多代理3D对象检测中的定量和定性实验表明,Disconet不仅可以实现比最先进的协作的感知方法更好的性能带宽权衡,而且还带来了更直接的设计理由。我们的代码可在https://github.com/ai4ce/disconet上找到。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译